SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering
نویسندگان
چکیده
Classification datasets often have an unequal class distribution among their examples. This problem is known as imbalanced classification. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most well-know data pre-processing methods to cope with it and tobalance thedifferentnumberof examples of eachclass.However, as recentworks claim, class imbalance is not a problem in itself and performance degradation is also associatedwith other factors related to the distribution of the data. One of these is the presence of noisy and borderline examples, the latter lying in the areas surrounding class boundaries. Certain intrinsic limitations of SMOTE canaggravate the problemproduced by these types of examples and current generalizations of SMOTE are not correctly adapted to their treatment. This paper proposes the extension of SMOTE through a new element, an iterative ensemblebased noise filter called Iterative-Partitioning Filter (IPF), which can overcome the problems produced by noisy and borderline examples in imbalanced datasets. This extension results in SMOTE–IPF. The properties of this proposal are discussed in a comprehensive experimental study. It is compared against a basic SMOTE and its most well-known generalizations. The experiments are carried out both on a set of synthetic datasets with different levels of noise and shapes of borderline examples as well as real-world datasets. Furthermore, the impact of introducing additional different types and levels of noise into these real-world data is studied. The results show that the newproposal performs better than existing SMOTE generalizations for all these different scenarios. The analysis of these results also helps to identify the characteristics of IPF which differentiate it from other filtering approaches. 2014 Elsevier Inc. All rights reserved.
منابع مشابه
Managing Borderline and Noisy Examples in Imbalanced Classification by Combining SMOTE with Ensemble Filtering
Imbalance data constitutes a great difficulty for most algorithms learning classifiers. However, as recent works claim, class imbalance is not a problem in itself and performance degradation is also associated with other factors related to the distribution of the data as the presence of noisy and borderline examples in the areas surrounding class boundaries. This contribution proposes to extend...
متن کاملBorderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority oversampling technique (S...
متن کاملSelecting Minority Examples from Misclassified Data for Over-Sampling
We introduce a method to deal with the problem of learning from imbalanced data sets, where examples of one class significantly outnumber examples of other classes. Our method selects minority examples from misclassified data given by an ensemble of classifiers. Then, these instances are over-sampled to create new synthetic examples using a variant of the well-known SMOTE algorithm. To build th...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملPreprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection
The Synthetic Minority Over Sampling TEchnique (SMOTE) is a widely used technique to balance imbalanced data. In this paper we focus on improving SMOTE in the presence of class noise. Many improvements of SMOTE have been proposed, mostly cleaning or improving the data after applying SMOTE. Our approach differs from these approaches by the fact that it cleans the data before applying SMOTE, such...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Inf. Sci.
دوره 291 شماره
صفحات -
تاریخ انتشار 2015